Project - Python For Data Analysis 🐍


Quantified drug consumption 💊

The objective of this project is to carry out a Data Science project from an imposed dataset. We get the following database: Drug Consumption Quantified from UCI Machine Learning repository.

It is derived from an online survey conducted between 2011 and 2012 among 1885 respondents aged 18 years and older from English-speaking countries. It collects demographic informations, three personality tests:

and 19 central nervous system psychoactive drugs with the following possibilities:

Never Used Used over a Decade Ago Used in Last Decade Used in Last Year Used in Last Month Used in Last Week Used in Last Day

The authors of the survey showed that there is a relationship between risk of addiction to drugs and personnality attributes.

Problem which can be solved:

From this dataset, we choose to address the following problematic:

How can we model the risk of addiction to a drug based on personnality test?

Set a mesurable objective for our project:

Table of Contents

Requirements

Import data

Data Vizualisation


⚠️ All variables have already been preprocessed (encoding) which makes the visualization more complicated and requires the creation of another dataset.

We don't have information on the real values for SS and impulsive

Global analysis of the dataset

Exploration of all columns

ID column

All ID values are unique

Features

As mentionned in the description of the dataset, we have:
Demographic informations

Psychological test scores

As said before, all the columns of the dataset have already been preprocessed (encoding).

Age

The majority of respondents are between 18 and 34 years old (59.63%), which is quite young.
There is a difference between the genders of the respondents, with women being older than men on average.

Gender

There is no gender difference, the distribution is almost perfect.

Equal gender repartition

Education

The majority of people have been to college (86.36%).
Women are on average more educated than men.

Country

The majority of respondents are from the US and UK (84.93%).
The majority of the Americans are men and the British are women.

Ethnicity

An overwhelming majority of respondents are white (91.25%). So there is a lack of diversity in the survey.
The gender repartition is equal.
The majority of white people comes from UK.

NEO-FFI-R score

The distributions of the scores follow a normal distribution.

BIS11

We don't have the real values associated to this test.
If we assume that a negative value means a low impulsivity score, the repartition seems to be equal: 50.82% have a low impulsivity score.

ImpSS

We don't have the real values associated to this test.
If we assume that a negative value means a low score, the repartition seems to be equal: 47.38% have a low score.

Target

There are 19 drugs that can be divided in the following categories:
Common

Substances diverted into drugs

Legal

Illegal

Fictional

Repartition for each drugs of all classes

  1. Drugs that most people have used (<5% never used)
    • Alcohol
    • Caff
    • Choc
  2. Drugs that at least 20% people never used it (<20% never used)
    • Nicotine
    • Cannabis
  3. Drugs that at least 50% people never used it (<50% never used)
    • Amphet
    • Benzos
    • Coke
    • Esctasy
    • Mushrooms
    • LSD
  4. Drugs that most people never used (>70% never used)
    • Amyl
    • Crack
    • Heroin
    • Ketamine
    • Meth
    • Semer
    • VSA

Repartition for each drugs of classes Used and not used

We can observe that:

Let's see the number of drugs used by person

We can divide the dataset into two groups:

Our Assumptions about features' influence on the target

We will verify these assumptions with the relationships between drug used and features

Our assumptions seems to be true

Missing values

There is no missing value

Relationships

Between Features

Highest correlations:

Features/Targets

Let's now calculate the statistic to use at least one drug and recompute correlation matrix

Our Assumptions

We will verify these assumptions with the relationships between drug used and features

Age - Drug used
Gender - Drug used
Education - Drug used
Country - Drug used
Ethnicity - Drug used

Between targets

PCA

We tried PCA but it didn't bring score improvements.

Preprocessing


  1. Target selection
    Drop drug that will not be used for modelling
  2. Target encoding
    We use label encoding
  3. Feature selection
    Drop feature that will not be used for modelling
  4. Feature encoding
    We use one hot encoding

Modeling


Metrics

Phases

Initial classes

Basic implementation

First metric: Accuracy

Accuracy is not the best metrics in this case as we can see here

Second metric: Accuracy balanced (classes took into account)
Third Metric: Confusion matrix

Example: Alcohol

Weighting (balanced classes)

In principal, unbalanced classes are not a problem at all for the k-nearest neighbor algorithm.

Because the algorithm is not influenced in any way by the size of the class, it will not favor any on the basis of size. T

Example for alcohol

Over sampling

Example Alcohol

New classes

We delete chocolate from the dataset

Based on the confusion matrix, we create new class depending on the drugs

Basic implementation

Example: Alcohol

Weighting (balanced classes)

Example for alcohol

Over sampling

Example Alcohol

Comparison

Thus, weighting is the best method for all drugs except for mushrooms but the difference is weak.
Let's now analyse the confusion matrix of the best model for each drugs to verify the results

Comparison with each method with confusion matrix

Confusion matrix for all drugs

Tuning hyperparameters

As said earlier, we used only the weighting technique because it is the most efficient method.

Alcohol
Amphet
Amyl
Benzos
Caff
Cannabis
Coke
Crack
Esctasy
Heroin
Ketamine
Legalh
LSD
Meth
Mushrooms
Nicotine
VSA
Save data
Comparison

Final models for each drugs


Alcohol

Amphet

Amyl

Benzos

Caff

Cannabis

Coke

Crack

Esctasy

Heroin

Ketamine

Legalh

LSD

Meth

Mushrooms

Nicotine

VSA

ALL

Save models

We save models for the django API